MEPS Data

Data head
PANEL REGION AGE31X GENDER RACE3 MARRY31X EDRECODE FTSTU31X ACTDTY31 HONRDC31 ... PCS42 MCS42 K6SUM42 PHQ242 EMPST31 POVCAT15 INSCOV15 INCOME_M HEALTHEXP PERSONWT
0 19 2 52 0.0 0.0 5 13 -1 2 2 ... 25.93 58.47 3 0 4 1 2 11390.0 46612 21854.981705
1 19 2 55 1.0 0.0 3 14 -1 2 2 ... 20.42 26.57 17 6 4 3 2 11390.0 9207 18169.604822
2 19 2 22 1.0 0.0 5 13 3 2 2 ... 53.12 50.33 7 0 1 2 2 18000.0 808 17191.832515
3 19 2 2 0.0 0.0 6 -1 -1 3 3 ... -1.00 -1.00 -1 -1 -1 2 2 385.0 2721 20261.485463
4 19 3 25 1.0 0.0 1 14 -1 2 2 ... 59.89 45.91 9 2 1 3 1 3700.0 1573 7620.222014
5 19 3 48 0.0 0.0 1 16 -1 2 2 ... -1.00 -1.00 -1 -1 1 5 1 85000.0 432 13019.508635
6 19 4 31 0.0 1.0 5 14 -1 2 2 ... 56.71 62.39 0 0 1 3 1 24000.0 413 3018.208554
7 19 1 37 0.0 0.0 1 15 -1 2 2 ... 42.91 58.76 0 0 1 5 1 56052.0 693 18017.598727
8 19 1 35 1.0 0.0 1 16 -1 2 2 ... 54.30 43.43 4 1 1 5 1 56052.0 5692 17508.950341
9 19 1 5 0.0 0.0 6 1 -1 3 3 ... -1.00 -1.00 -1 -1 -1 5 1 0.0 301 18158.487104

10 rows × 46 columns

Data description

count    18350.000000
mean        19.529264
std          0.499156
min         19.000000
25%         19.000000
50%         20.000000
75%         20.000000
max         20.000000
Name: PANEL, dtype: float64 

count    18350.000000
mean         2.607466
std          0.942848
min          1.000000
25%          2.000000
50%          3.000000
75%          3.000000
max          4.000000
Name: REGION, dtype: float64 

count    18350.000000
mean        38.746649
std         23.020492
min          0.000000
25%         19.000000
50%         38.500000
75%         57.000000
max         85.000000
Name: AGE31X, dtype: float64 

count    18350.000000
mean         0.521526
std          0.499550
min          0.000000
25%          0.000000
50%          1.000000
75%          1.000000
max          1.000000
Name: GENDER, dtype: float64 

count    18350.000000
mean         0.338147
std          0.473092
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max          1.000000
Name: RACE3, dtype: float64 

count    18350.000000
mean         3.590954
std          2.262703
min          1.000000
25%          1.000000
50%          5.000000
75%          5.000000
max         10.000000
Name: MARRY31X, dtype: float64 

count    18350.000000
mean         9.842943
std          6.226279
min         -1.000000
25%          2.000000
50%         13.000000
75%         14.000000
max         16.000000
Name: EDRECODE, dtype: float64 

count    18350.000000
mean        -0.759619
std          0.855099
min         -1.000000
25%         -1.000000
50%         -1.000000
75%         -1.000000
max          3.000000
Name: FTSTU31X, dtype: float64 

count    18350.000000
mean         2.638692
std          0.813550
min          1.000000
25%          2.000000
50%          2.000000
75%          3.000000
max          4.000000
Name: ACTDTY31, dtype: float64 

count    18350.000000
mean         2.156948
std          0.523573
min          1.000000
25%          2.000000
50%          2.000000
75%          2.000000
max          4.000000
Name: HONRDC31, dtype: float64 

count    18350.000000
mean         2.177929
std          1.095924
min         -1.000000
25%          1.000000
50%          2.000000
75%          3.000000
max          5.000000
Name: RTHLTH31, dtype: float64 

count    18350.000000
mean         1.932316
std          1.021325
min         -1.000000
25%          1.000000
50%          2.000000
75%          3.000000
max          5.000000
Name: MNHLTH31, dtype: float64 

count    18350.000000
mean         1.031771
std          1.176932
min         -1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max          2.000000
Name: HIBPDX, dtype: float64 

count    18350.000000
mean         1.277384
std          1.246940
min         -1.000000
25%          1.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: CHDDX, dtype: float64 

count    18350.000000
mean         1.302125
std          1.251105
min         -1.000000
25%          2.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: ANGIDX, dtype: float64 

count    18350.000000
mean         1.288447
std          1.248865
min         -1.000000
25%          1.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: MIDX, dtype: float64 

count    18350.000000
mean         1.223215
std          1.236044
min         -1.000000
25%          1.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: OHRTDX, dtype: float64 

count    18350.000000
mean         1.284687
std          1.248222
min         -1.000000
25%          1.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: STRKDX, dtype: float64 

count    18350.000000
mean         1.303215
std          1.251277
min         -1.000000
25%          2.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: EMPHDX, dtype: float64 

count    18350.000000
mean         1.275368
std          1.267648
min         -1.000000
25%          1.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: CHBRON31, dtype: float64 

count    18350.000000
mean         1.078093
std          1.194322
min         -1.000000
25%          1.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: CHOLDX, dtype: float64 

count    18350.000000
mean         1.233297
std          1.238259
min         -1.000000
25%          1.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: CANCERDX, dtype: float64 

count    18350.000000
mean         1.236785
std          1.239005
min         -1.000000
25%          1.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: DIABDX, dtype: float64 

count    18350.000000
mean         0.977657
std          1.176663
min         -1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max          2.000000
Name: JTPAIN31, dtype: float64 

count    18350.000000
mean         1.084687
std          1.196631
min         -1.000000
25%          1.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: ARTHDX, dtype: float64 

count    18350.000000
mean        -0.208174
std          1.460262
min         -1.000000
25%         -1.000000
50%         -1.000000
75%         -1.000000
max          3.000000
Name: ARTHTYPE, dtype: float64 

count    18350.000000
mean         1.883106
std          0.330335
min         -1.000000
25%          2.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: ASTHDX, dtype: float64 

count    18350.000000
mean        -0.451989
std          1.136285
min         -1.000000
25%         -1.000000
50%         -1.000000
75%         -1.000000
max          2.000000
Name: ADHDADDX, dtype: float64 

count    18350.00000
mean        -0.44812
std          1.15316
min         -1.00000
25%         -1.00000
50%         -1.00000
75%         -1.00000
max          2.00000
Name: PREGNT31, dtype: float64 

count    18350.000000
mean         1.865014
std          0.355783
min         -1.000000
25%          2.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: WLKLIM31, dtype: float64 

count    18350.000000
mean         1.716349
std          0.750265
min         -1.000000
25%          2.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: ACTLIM31, dtype: float64 

count    18350.000000
mean         1.934877
std          0.265885
min         -1.000000
25%          2.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: SOCLIM31, dtype: float64 

count    18350.000000
mean         1.241744
std          1.261227
min         -1.000000
25%          1.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: COGLIM31, dtype: float64 

count    18350.000000
mean         1.921308
std          0.364688
min         -1.000000
25%          2.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: DFHEAR42, dtype: float64 

count    18350.000000
mean         1.937602
std          0.344966
min         -1.000000
25%          2.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: DFSEE42, dtype: float64 

count    18350.000000
mean         0.900926
std          1.355978
min         -1.000000
25%         -1.000000
50%          2.000000
75%          2.000000
max          2.000000
Name: ADSMOK42, dtype: float64 

count    18350.000000
mean        32.348196
std         25.021569
min         -9.000000
25%         -1.000000
50%         43.310000
75%         55.090000
max         71.060000
Name: PCS42, dtype: float64 

count    18350.000000
mean        34.368819
std         25.934907
min         -9.000000
25%         -1.000000
50%         46.735000
75%         57.060000
max         74.980000
Name: MCS42, dtype: float64 

count    18350.000000
mean         1.664687
std          4.106635
min         -9.000000
25%         -1.000000
50%          0.000000
75%          3.000000
max         24.000000
Name: K6SUM42, dtype: float64 

count    18350.000000
mean         0.136948
std          1.329289
min         -1.000000
25%         -1.000000
50%          0.000000
75%          0.000000
max          6.000000
Name: PHQ242, dtype: float64 

count    18350.000000
mean         1.526376
std          1.842521
min         -1.000000
25%          1.000000
50%          1.000000
75%          4.000000
max          4.000000
Name: EMPST31, dtype: float64 

count    18350.000000
mean         3.510627
std          1.461804
min          1.000000
25%          3.000000
50%          4.000000
75%          5.000000
max          5.000000
Name: POVCAT15, dtype: float64 

count    18350.000000
mean         1.446921
std          0.624748
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max          3.000000
Name: INSCOV15, dtype: float64 

count     18350.000000
mean      27853.695313
std       36225.013969
min           0.000000
25%           0.000000
50%       16200.000000
75%       40000.000000
max      320299.000000
Name: INCOME_M, dtype: float64 

count     18350.000000
mean       5184.511608
std       15126.748532
min           0.000000
25%         198.000000
50%        1034.000000
75%        4219.500000
max      659952.000000
Name: HEALTHEXP, dtype: float64 

count    18350.000000
mean     11991.877066
std       9405.954874
min          0.000000
25%       5470.448863
50%       9867.351501
75%      15662.855181
max      98103.984953
Name: PERSONWT, dtype: float64 

Data has a long taile, hence logarithmic (base 3) transformation of explained variable (HEALTHEXP).

Categorical variables:
 ['HIBPDX', 'CHDDX', 'ANGIDX', 'MIDX', 'OHRTDX', 'STRKDX', 'EMPHDX', 'CHBRON31', 'CHOLDX', 'CANCERDX', 'DIABDX', 'JTPAIN31', 'ARTHDX', 'ASTHDX', 'ADHDADDX', 'PREGNT31', 'WLKLIM31', 'ACTLIM31', 'SOCLIM31', 'COGLIM31', 'DFHEAR42', 'DFSEE42', 'ADSMOK42']

Model - XGB and Linear

XGB   results:
training rmse: 1.977698616711458
training r2: 0.48684795508177536
training mae: 1.4688560597283682
test rmse: 2.1665237944616744
test r2: 0.37313414322890304
test mae: 1.6149617997637638

Lasso Regression results:
training rmse: 2.530774417382313
training r2: 0.15970318823228213
training mae: 1.9013310836499353
test rmse: 2.4984493904113023
test r2: 0.1663403093243384
test mae: 1.862761172732874

Explaining model

Ceteris Paribus explanations for XGB

Preparation of a new explainer is initiated

  -> data              : 14680 rows 44 cols
  -> target variable   : Argument 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 14680 values
  -> model_class       : xgboost.sklearn.XGBRegressor (default)
  -> label             : MEPS
  -> predict function  : <function yhat_default at 0x7f9148a5e670> will be used (default)
  -> predicted values  : min = -0.8554313, mean = 5.7083073, max = 10.244568
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -7.963533878326416, mean = 0.0019564959693102344, max = 6.395156118316233
  -> model_info        : package xgboost

A new explainer has been created!
Patien no:  3639  , prediction value:  6.5051827  true value:  [6.50884896] 


Calculating ceteris paribus!: 100%|██████████| 44/44 [00:00<00:00, 148.77it/s]

3

Patient 3639 has rather higher prediction since he has joint pain (JTPAIN31), asthma diagnosis (ASTHDX), low overall ratings of feelings (PHQ242) and high income (POVCAT15) and has a private insurance coverage (INSCOV15). Lack of other positive diagnoses decreases the prediction.

Calculating ceteris paribus!:  14%|█▎        | 6/44 [00:00<00:00, 51.80it/s]
Patien no:  975  , prediction value:  0.00083711743  true value:  [0.] 


Calculating ceteris paribus!: 100%|██████████| 44/44 [00:00<00:00, 47.70it/s]

Patient number 975 has a zero cost prediction since he is not diagnosed with any disease. Each positive diagnosis would increase his payment.

Calculating ceteris paribus!:  25%|██▌       | 11/44 [00:00<00:00, 107.00it/s]
Patien no:  896  , prediction value:  8.826893  true value:  [8.82324185] 


Calculating ceteris paribus!: 100%|██████████| 44/44 [00:00<00:00, 122.92it/s]

Patient number 896 has a high prediction since he is tested positive on high blood pressure, diabetes, asthma, has limitations in physical as well as in work/house/school functioning. He has rather low value of physical summary.

4

Patient 896 and patient 975 are good examples. Change in value of HIBDX (high blood diagnosis), ASTHDX (asthma diagnosis), WLKLIM31(limitation in physical functioning ) lowers the prediction for patient 896 and increases the prediction for the patient 975. The result intuitively make sense since it states that the more positive diagnosis the higher prediction.¶

Comparing CP explanations for XGB with Lasso Regression

Creating explainer for linear model
Preparation of a new explainer is initiated

  -> data              : 14680 rows 44 cols
  -> target variable   : Argument 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 14680 values
  -> model_class       : sklearn.linear_model._coordinate_descent.Lasso (default)
  -> label             : MEPS
  -> predict function  : <function yhat_default at 0x7f9148a5e670> will be used (default)
  -> predicted values  : min = 3.251906166715444, mean = 5.710264253185743, max = 9.922703669439233
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -9.922703669439233, mean = -1.8586403987185183e-16, max = 6.282237531056754
  -> model_info        : package sklearn

A new explainer has been created!
Calculating ceteris paribus!:  16%|█▌        | 7/44 [00:00<00:00, 69.92it/s]
Patien no:  704  , prediction value:  8.046038  true value:  [8.04268481] 


Calculating ceteris paribus!: 100%|██████████| 44/44 [00:00<00:00, 104.69it/s]

XGB explanation

Patient 704 is a 59 years old femmel. Her cost is predicted to be high due to her age and not being a student (which suggest correlation), high blood pressure diagnosis (HIBDX), coronary heart disease (CHDDX) with other heart disease (OHRTDX), arthritis diagnosis, limitations in physical functioning (WLKLIM31), social limitations (SOCLIM). Not being diagnosed with high cholesterol (CHOLDX) lowers 3 times the costs. Not being tested positive for cancer, diabetes also lowers the cost.

Calculating ceteris paribus!:  50%|█████     | 22/44 [00:00<00:00, 209.97it/s]
Patien no:  704  , prediction value:  7.507052624767924  true value:  [8.04268481] 


Calculating ceteris paribus!: 100%|██████████| 44/44 [00:00<00:00, 195.91it/s]

Lasso explanation

The same patient as above (704). Linear lasso model predicted 1.5 lower health expenses. The linear model did not focus on categorical variables. The cost was high due to patient age, overall ratings of feelings.

5

It is easier to explain a more complex model with CP in this case and the explanation makes more sense intuitively with the xgb model.